Exploratory Data Analysis¶
Exploratory data analysis (EDA) involves taking a first look at a dataset and summarising its salient characteristics using tables and graphics. It is (or should be) the stage before testing hypotheses and can be useful in informing hypotheses. In this chapter, we’ll look at a few options for EDA using code.
To show how to do EDA using code, we will need a dataset to explore. We’ll use the Grinell housing dataset, which covers houses sold between 2005 and 2015 in Grinnell, Iowa. It was assembled by a local estate agent, Matt Karjalahti.
First, let’s import a couple of essential packages:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
from rich import print
# Plot settings
plt.style.use(
"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
warnings.filterwarnings('ignore')
Preliminaries: extracting, typing, and initial cleaning¶
Let’s grab some data to do exploratory data analysis on.
df = pd.read_csv(
"https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/GrinnellHouses.csv",
index_col=0,
)
df.head()
| Date | Address | Bedrooms | Baths | SquareFeet | LotSize | YearBuilt | YearSold | MonthSold | DaySold | CostPerSqFt | OrigPrice | ListPrice | SalePrice | SPLPPct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 16695 | 1510 First Ave #112 | 2 | 1.0 | 1120.0 | NaN | 1993 | 2005 | 9 | 16 | 6.25 | 17000 | 10500 | 7000 | 66.67 |
| 2 | 16880 | 1020 Center St | 3 | 1.0 | 1224.0 | 0.172176 | 1900 | 2006 | 3 | 20 | 22.06 | 35000 | 35000 | 27000 | 77.14 |
| 3 | 16875 | 918 Chatterton St | 4 | 1.0 | 1540.0 | NaN | 1970 | 2006 | 3 | 15 | 18.18 | 54000 | 47000 | 28000 | 59.57 |
| 4 | 16833 | 1023 & 1025 Spring St. | 3 | 1.0 | 1154.0 | NaN | 1900 | 2006 | 2 | 1 | 26.00 | 65000 | 49000 | 30000 | 61.22 |
| 5 | 16667 | 503 2nd Ave | 3 | 1.0 | 1277.0 | 0.206612 | 1900 | 2005 | 8 | 19 | 24.08 | 35000 | 35000 | 30750 | 87.86 |
There are some things we should do even before an exploratory data analysis; we should make sure we’ve cast the columns to the right types and converted any that need converting. Even a quick look at the data provided by .head() shows that the Date column is in a weird format. It also seems like the number of Baths column has some data issues because it is a floating point number rather than an integer. However, this a North American quirk: realtors refer to bathrooms with only a toilet and sink as a “half bath” (and sometimes those with a shower but no bathtub as a “three-quarter bath,” which also shows up in the data). So this is okay, and we mainly need concern ourselves with the Date column.
Fortunately, there is guidance attached to the dataset as to what the Date column really means: it’s just a counter of days that encodes the date of sale that begins with 1st Jan 2005=16436. The documents also tell us that the final year is 2015. We’d really like to have a more usable datetime column than this so let’s sort this out first. Ideally, we’d like to map the Date column into a new datetime column that turns 16436 into 1st Jan 2005, 16437 into 2nd Jan 2005, and so on, until we have a mapping for every date between the first date in the dataset and the last.
Our solution will be to create a function that maps the code for Date into a genuine datetime using the pd.DateOffset function and apply it to the Date column using apply:
def convert_date_code_to_datetime(date):
start_code = 16436
return pd.to_datetime("01-01-2005") + pd.DateOffset(days=date - start_code)
df["datetime"] = df["Date"].apply(convert_date_code_to_datetime)
# Check the 'biggest' rows by datetime
df.nlargest(5, columns=["datetime"])
| Date | Address | Bedrooms | Baths | SquareFeet | LotSize | YearBuilt | YearSold | MonthSold | DaySold | CostPerSqFt | OrigPrice | ListPrice | SalePrice | SPLPPct | datetime | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 915 | 20186 | 957 350th Ave | 3 | 2.00 | 1700.0 | 11.430000 | 1966 | 2015 | 4 | 8 | 128.82 | 229000 | 229000 | 219000 | 95.63 | 2015-04-09 |
| 886 | 20185 | 232 4th Ave W | 3 | 2.50 | NaN | 0.139991 | 2004 | 2015 | 4 | 7 | 0.00 | 137500 | 137500 | 128500 | 93.45 | 2015-04-08 |
| 905 | 20185 | 1726 Summer St | 4 | 1.75 | 1512.0 | 0.530000 | 1996 | 2015 | 4 | 7 | 125.00 | 239900 | 219900 | 189000 | 85.95 | 2015-04-08 |
| 920 | 20184 | 108 East St | 4 | 3.00 | 1650.0 | 0.320000 | 2013 | 2015 | 4 | 6 | 156.97 | 259000 | 259000 | 259000 | 100.00 | 2015-04-07 |
| 834 | 20181 | 1405 Prince St | 3 | 1.50 | 1094.0 | NaN | 1900 | 2015 | 4 | 3 | 18.28 | 32500 | 32500 | 20000 | 61.54 | 2015-04-04 |
An extra column containing datetime has been added and, looking at the max rows, we can see that it does indeed run all the way to 2015 as expected from the documentation of the dataset.
Okay, now we want to sort out the data type isses we saw earlier. But let’s just check they’re as bad as we think using df.info()
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 929 entries, 1 to 929
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 929 non-null int64
1 Address 929 non-null object
2 Bedrooms 929 non-null int64
3 Baths 929 non-null float64
4 SquareFeet 911 non-null float64
5 LotSize 741 non-null float64
6 YearBuilt 929 non-null int64
7 YearSold 929 non-null int64
8 MonthSold 929 non-null int64
9 DaySold 929 non-null int64
10 CostPerSqFt 929 non-null float64
11 OrigPrice 929 non-null int64
12 ListPrice 929 non-null int64
13 SalePrice 929 non-null int64
14 SPLPPct 929 non-null float64
15 datetime 929 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(5), int64(9), object(1)
memory usage: 123.4+ KB
pandas read most of them in sensibly, but not all of them, so let’s adjust the ones we need to:
df = df.assign(
Address=df["Address"].astype("string"),
Bedrooms=df["Bedrooms"].astype("category"),
OrigPrice=df["OrigPrice"].astype(float),
ListPrice=df["ListPrice"].astype(float),
SalePrice=df["SalePrice"].astype(float),
)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 929 entries, 1 to 929
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 929 non-null int64
1 Address 929 non-null string
2 Bedrooms 929 non-null category
3 Baths 929 non-null float64
4 SquareFeet 911 non-null float64
5 LotSize 741 non-null float64
6 YearBuilt 929 non-null int64
7 YearSold 929 non-null int64
8 MonthSold 929 non-null int64
9 DaySold 929 non-null int64
10 CostPerSqFt 929 non-null float64
11 OrigPrice 929 non-null float64
12 ListPrice 929 non-null float64
13 SalePrice 929 non-null float64
14 SPLPPct 929 non-null float64
15 datetime 929 non-null datetime64[ns]
dtypes: category(1), datetime64[ns](1), float64(8), int64(5), string(1)
memory usage: 117.4 KB
Having to do some initial variable type cleaning is a normal and unavoidable part of data analysis, especially when reading in from a format like CSV (which does not preserve data type but has great interoperability across systems). It’s important and what we’ve seen here is a typical pattern.
There’s just one more step before starting the EDA proper. If you haven’t reached the Chapter on best practice, which talks about naming conventions, you won’t know this yet but the usual naming convention for variables and columns in Python is so-called snake case (appropriately enough). An example of a word not in snake case would be ‘ThisIsAColName’; actually that’s in Camel Case. Rendered in snake case, it would be ‘this_is_a_col_name’, which is longer but slightly easier to read. So we should really change the column names to be in snake case. Fortunately, the dataprep package has us covered here; it provides methods to translate between pretty much every naming convention that exists. Let’s change our columns to snake case, which is the default setting of clean_headers.
from dataprep.clean import clean_headers
df = clean_headers(df, case="snake")
print(df.columns)
Column Headers Cleaning Report:
15 values cleaned (93.75%)
Index(['date', 'address', 'bedrooms', 'baths', 'square_feet', 'lot_size', 'year_built', 'year_sold', 'month_sold', 'day_sold', 'cost_per_sq_ft', 'orig_price', 'list_price', 'sale_price', 'splp_pct', 'datetime'], dtype='object')
Now we can start the exploratory data analysis!
EDA using pandas built-in methods¶
pandas has some great options for built-in EDA; in fact we’ve already seen one of them, df.info() which, as well as reporting datatypes and memory usage, also tells us how many observations in each column are ‘truthy’ rather than ‘falsy’, ie how many have non-null values.
Exploratory tables and descriptive statistics¶
A small step beyond df.info to get tables is to use df.describe() which, if you have mixed datatypes that include floats, will report some basic summary statistics:
df.describe()
| date | baths | square_feet | lot_size | year_built | year_sold | month_sold | day_sold | cost_per_sq_ft | orig_price | list_price | sale_price | splp_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 929.000000 | 929.000000 | 911.000000 | 741.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 | 929.000000 |
| mean | 18223.215285 | 1.779333 | 1583.215148 | 0.723463 | 1945.903122 | 2009.365985 | 6.831001 | 16.226050 | 83.291561 | 146047.083961 | 141399.138859 | 133203.720129 | 93.563994 |
| std | 1139.171002 | 0.743154 | 655.664966 | 2.725697 | 37.209902 | 3.124149 | 3.035790 | 9.396815 | 35.762241 | 85174.109372 | 82228.201763 | 77959.624697 | 6.953764 |
| min | 16527.000000 | 0.000000 | 640.000000 | 0.028926 | 1870.000000 | 2005.000000 | 1.000000 | 1.000000 | 0.000000 | 5990.000000 | 10500.000000 | 7000.000000 | 46.500000 |
| 25% | 17169.000000 | 1.000000 | 1150.000000 | 0.233884 | 1900.000000 | 2007.000000 | 5.000000 | 8.000000 | 60.820000 | 89900.000000 | 88000.000000 | 83000.000000 | 91.320000 |
| 50% | 18046.000000 | 1.750000 | 1440.000000 | 0.284091 | 1956.000000 | 2009.000000 | 7.000000 | 16.000000 | 82.880000 | 129900.000000 | 125000.000000 | 119340.000000 | 94.920000 |
| 75% | 19281.000000 | 2.000000 | 1833.000000 | 0.370179 | 1973.000000 | 2012.000000 | 9.000000 | 25.000000 | 105.770000 | 179000.000000 | 172000.000000 | 162500.000000 | 97.560000 |
| max | 20186.000000 | 6.000000 | 6815.000000 | 55.000000 | 2013.000000 | 2015.000000 | 12.000000 | 31.000000 | 258.790000 | 695000.000000 | 695000.000000 | 606000.000000 | 111.020000 |
Although helpful, that sure is hard to read! We can improve this by using the round() method too:
sum_table = df.describe().round(1)
sum_table
| date | baths | square_feet | lot_size | year_built | year_sold | month_sold | day_sold | cost_per_sq_ft | orig_price | list_price | sale_price | splp_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 929.0 | 929.0 | 911.0 | 741.0 | 929.0 | 929.0 | 929.0 | 929.0 | 929.0 | 929.0 | 929.0 | 929.0 | 929.0 |
| mean | 18223.2 | 1.8 | 1583.2 | 0.7 | 1945.9 | 2009.4 | 6.8 | 16.2 | 83.3 | 146047.1 | 141399.1 | 133203.7 | 93.6 |
| std | 1139.2 | 0.7 | 655.7 | 2.7 | 37.2 | 3.1 | 3.0 | 9.4 | 35.8 | 85174.1 | 82228.2 | 77959.6 | 7.0 |
| min | 16527.0 | 0.0 | 640.0 | 0.0 | 1870.0 | 2005.0 | 1.0 | 1.0 | 0.0 | 5990.0 | 10500.0 | 7000.0 | 46.5 |
| 25% | 17169.0 | 1.0 | 1150.0 | 0.2 | 1900.0 | 2007.0 | 5.0 | 8.0 | 60.8 | 89900.0 | 88000.0 | 83000.0 | 91.3 |
| 50% | 18046.0 | 1.8 | 1440.0 | 0.3 | 1956.0 | 2009.0 | 7.0 | 16.0 | 82.9 | 129900.0 | 125000.0 | 119340.0 | 94.9 |
| 75% | 19281.0 | 2.0 | 1833.0 | 0.4 | 1973.0 | 2012.0 | 9.0 | 25.0 | 105.8 | 179000.0 | 172000.0 | 162500.0 | 97.6 |
| max | 20186.0 | 6.0 | 6815.0 | 55.0 | 2013.0 | 2015.0 | 12.0 | 31.0 | 258.8 | 695000.0 | 695000.0 | 606000.0 | 111.0 |
Published summary statistics tables often list one variable per row, and if your dataframe has many variables, describe() can quickly get too wide to read easily. You can transpose it using the T property (or the transpose() method):
sum_table = sum_table.T
sum_table
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| date | 929.0 | 18223.2 | 1139.2 | 16527.0 | 17169.0 | 18046.0 | 19281.0 | 20186.0 |
| baths | 929.0 | 1.8 | 0.7 | 0.0 | 1.0 | 1.8 | 2.0 | 6.0 |
| square_feet | 911.0 | 1583.2 | 655.7 | 640.0 | 1150.0 | 1440.0 | 1833.0 | 6815.0 |
| lot_size | 741.0 | 0.7 | 2.7 | 0.0 | 0.2 | 0.3 | 0.4 | 55.0 |
| year_built | 929.0 | 1945.9 | 37.2 | 1870.0 | 1900.0 | 1956.0 | 1973.0 | 2013.0 |
| year_sold | 929.0 | 2009.4 | 3.1 | 2005.0 | 2007.0 | 2009.0 | 2012.0 | 2015.0 |
| month_sold | 929.0 | 6.8 | 3.0 | 1.0 | 5.0 | 7.0 | 9.0 | 12.0 |
| day_sold | 929.0 | 16.2 | 9.4 | 1.0 | 8.0 | 16.0 | 25.0 | 31.0 |
| cost_per_sq_ft | 929.0 | 83.3 | 35.8 | 0.0 | 60.8 | 82.9 | 105.8 | 258.8 |
| orig_price | 929.0 | 146047.1 | 85174.1 | 5990.0 | 89900.0 | 129900.0 | 179000.0 | 695000.0 |
| list_price | 929.0 | 141399.1 | 82228.2 | 10500.0 | 88000.0 | 125000.0 | 172000.0 | 695000.0 |
| sale_price | 929.0 | 133203.7 | 77959.6 | 7000.0 | 83000.0 | 119340.0 | 162500.0 | 606000.0 |
| splp_pct | 929.0 | 93.6 | 7.0 | 46.5 | 91.3 | 94.9 | 97.6 | 111.0 |
Now, although this is very basic, let’s say you wanted to include it as a table of summary statistics in a paper. This is just a dataframe so you can export it just like you can any pandas dataframe. This includes options like .to_excel. For inclusion in a paper or report, you’re most likely to want to report it as text:
print(sum_table.to_string())
count mean std min 25% 50% 75% max date 929.0 18223.2 1139.2 16527.0 17169.0 18046.0 19281.0 20186.0 baths 929.0 1.8 0.7 0.0 1.0 1.8 2.0 6.0 square_feet 911.0 1583.2 655.7 640.0 1150.0 1440.0 1833.0 6815.0 lot_size 741.0 0.7 2.7 0.0 0.2 0.3 0.4 55.0 year_built 929.0 1945.9 37.2 1870.0 1900.0 1956.0 1973.0 2013.0 year_sold 929.0 2009.4 3.1 2005.0 2007.0 2009.0 2012.0 2015.0 month_sold 929.0 6.8 3.0 1.0 5.0 7.0 9.0 12.0 day_sold 929.0 16.2 9.4 1.0 8.0 16.0 25.0 31.0 cost_per_sq_ft 929.0 83.3 35.8 0.0 60.8 82.9 105.8 258.8 orig_price 929.0 146047.1 85174.1 5990.0 89900.0 129900.0 179000.0 695000.0 list_price 929.0 141399.1 82228.2 10500.0 88000.0 125000.0 172000.0 695000.0 sale_price 929.0 133203.7 77959.6 7000.0 83000.0 119340.0 162500.0 606000.0 splp_pct 929.0 93.6 7.0 46.5 91.3 94.9 97.6 111.0
as Markdown:
print(sum_table.to_markdown())
| | count | mean | std | min | 25% | 50% | 75% | max | |:---------------|--------:|---------:|--------:|--------:|--------:|---------:|---------:|-- -------:| | date | 929 | 18223.2 | 1139.2 | 16527 | 17169 | 18046 | 19281 | 20186 | | baths | 929 | 1.8 | 0.7 | 0 | 1 | 1.8 | 2 | 6 | | square_feet | 911 | 1583.2 | 655.7 | 640 | 1150 | 1440 | 1833 | 6815 | | lot_size | 741 | 0.7 | 2.7 | 0 | 0.2 | 0.3 | 0.4 | 55 | | year_built | 929 | 1945.9 | 37.2 | 1870 | 1900 | 1956 | 1973 | 2013 | | year_sold | 929 | 2009.4 | 3.1 | 2005 | 2007 | 2009 | 2012 | 2015 | | month_sold | 929 | 6.8 | 3 | 1 | 5 | 7 | 9 | 12 | | day_sold | 929 | 16.2 | 9.4 | 1 | 8 | 16 | 25 | 31 | | cost_per_sq_ft | 929 | 83.3 | 35.8 | 0 | 60.8 | 82.9 | 105.8 | 258.8 | | orig_price | 929 | 146047 | 85174.1 | 5990 | 89900 | 129900 | 179000 | 695000 | | list_price | 929 | 141399 | 82228.2 | 10500 | 88000 | 125000 | 172000 | 695000 | | sale_price | 929 | 133204 | 77959.6 | 7000 | 83000 | 119340 | 162500 | 606000 | | splp_pct | 929 | 93.6 | 7 | 46.5 | 91.3 | 94.9 | 97.6 | 111 |
as HTML:
print(sum_table.head(3).to_html())
<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>count</th> <th>mean</th> <th>std</th> <th>min</th> <th>25%</th> <th>50%</th> <th>75%</th> <th>max</th> </tr> </thead> <tbody> <tr> <th>date</th> <td>929.0</td> <td>18223.2</td> <td>1139.2</td> <td>16527.0</td> <td>17169.0</td> <td>18046.0</td> <td>19281.0</td> <td>20186.0</td> </tr> <tr> <th>baths</th> <td>929.0</td> <td>1.8</td> <td>0.7</td> <td>0.0</td> <td>1.0</td> <td>1.8</td> <td>2.0</td> <td>6.0</td> </tr> <tr> <th>square_feet</th> <td>911.0</td> <td>1583.2</td> <td>655.7</td> <td>640.0</td> <td>1150.0</td> <td>1440.0</td> <td>1833.0</td> <td>6815.0</td> </tr> </tbody> </table>
or as latex:
print(sum_table.to_latex(caption="Summary stats from EDA"))
\begin{table} \centering \caption{Summary stats from EDA} \begin{tabular}{lrrrrrrrr} \toprule {} & count & mean & std & min & 25\% & 50\% & 75\% & max \\ \midrule date & 929.0 & 18223.2 & 1139.2 & 16527.0 & 17169.0 & 18046.0 & 19281.0 & 20186.0 \\ baths & 929.0 & 1.8 & 0.7 & 0.0 & 1.0 & 1.8 & 2.0 & 6.0 \\ square\_feet & 911.0 & 1583.2 & 655.7 & 640.0 & 1150.0 & 1440.0 & 1833.0 & 6815.0 \\ lot\_size & 741.0 & 0.7 & 2.7 & 0.0 & 0.2 & 0.3 & 0.4 & 55.0 \\ year\_built & 929.0 & 1945.9 & 37.2 & 1870.0 & 1900.0 & 1956.0 & 1973.0 & 2013.0 \\ year\_sold & 929.0 & 2009.4 & 3.1 & 2005.0 & 2007.0 & 2009.0 & 2012.0 & 2015.0 \\ month\_sold & 929.0 & 6.8 & 3.0 & 1.0 & 5.0 & 7.0 & 9.0 & 12.0 \\ day\_sold & 929.0 & 16.2 & 9.4 & 1.0 & 8.0 & 16.0 & 25.0 & 31.0 \\ cost\_per\_sq\_ft & 929.0 & 83.3 & 35.8 & 0.0 & 60.8 & 82.9 & 105.8 & 258.8 \\ orig\_price & 929.0 & 146047.1 & 85174.1 & 5990.0 & 89900.0 & 129900.0 & 179000.0 & 695000.0 \\ list\_price & 929.0 & 141399.1 & 82228.2 & 10500.0 & 88000.0 & 125000.0 & 172000.0 & 695000.0 \\ sale\_price & 929.0 & 133203.7 & 77959.6 & 7000.0 & 83000.0 & 119340.0 & 162500.0 & 606000.0 \\ splp\_pct & 929.0 & 93.6 & 7.0 & 46.5 & 91.3 & 94.9 & 97.6 & 111.0 \\ \bottomrule \end{tabular} \end{table}
And remember, with all of these, you can write them to file using the following structure:
open('sum_stats_table.tex', 'w').write(sum_table.to_latex(caption='Summary stats from EDA'))
Of course, the stats provided in this pre-built table are not very customised. So what do we do to get the table that we actually want? Well, the answer is to draw on the contents of the previous data chapters, particularly the introduction to data analysis. Groupbys, merges, aggregations: use all of them to produce the EDA table that you want.
If you’re exploring data, you might also want to be able to read everything clearly and see any deviations from what you’d expect quickly. pandas has some built-in functionality that styles dataframes to help you. These styles persist when you export the dataframe to, say, Excel, too.
Here’s an example that highlights some ways of styling dataframes, making use of several features such as: unstacking into a wider format (unstack), changing the units (lambda function), fill NaNs with unobtrusive strings (.fillna('-')), removing numbers after the decimal place (.style.format(precision=0)), and adding a caption (.style.set_caption).
(
df.groupby(["year_sold", "bedrooms"])
.mean()["sale_price"]
.unstack()
.apply(lambda x: x / 1e3)
.fillna("-")
.style.format(precision=0)
.set_caption("Sale price (thousands)")
)
| bedrooms | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|---|---|---|---|---|---|---|---|---|
| year_sold | |||||||||
| 2005 | 155 | - | 82 | 112 | 141 | 174 | 205 | 81 | - |
| 2006 | 25 | - | 80 | 125 | 139 | 281 | - | - | 205 |
| 2007 | - | - | 83 | 110 | 171 | 245 | 400 | - | - |
| 2008 | 238 | 62 | 82 | 132 | 158 | 236 | 221 | - | - |
| 2009 | - | 84 | 78 | 104 | 129 | 159 | - | - | - |
| 2010 | - | - | 68 | 120 | 161 | 195 | - | - | - |
| 2011 | 128 | - | 83 | 128 | 176 | 213 | - | - | - |
| 2012 | 191 | 38 | 98 | 132 | 154 | 213 | 305 | - | - |
| 2013 | 382 | 88 | 77 | 129 | 197 | 182 | 141 | - | - |
| 2014 | 55 | 65 | 94 | 147 | 157 | 198 | - | - | - |
| 2015 | - | - | 93 | 126 | 213 | - | - | - | - |
(Another way to achieve the groupby, mean, and unstack steps in one step is using df.pivot_table(index="YearSold", columns="Bedrooms", values="SalePrice", aggfunc=np.mean.)
Although a neater one than we’ve seen, this is still a drab table of numbers. The eye is not immediately drawn to it!
To remedy that, let’s take a look at another styling technique: the use of colour. Let’s say we wanted to make a table that showed a cross-tabulation between year and number of bathrooms. Naturally, we’ll use pd.crosstab but we’ll ask that the values that appear in the table (counts) be lit up with a heatmap:
pd.crosstab(df["bedrooms"], df["year_sold"]).style.background_gradient(cmap="plasma")
| year_sold | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| bedrooms | |||||||||||
| 0 | 1 | 1 | 0 | 1 | 0 | 0 | 3 | 3 | 1 | 2 | 0 |
| 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 2 | 1 | 0 |
| 2 | 15 | 18 | 24 | 17 | 14 | 10 | 10 | 11 | 18 | 11 | 1 |
| 3 | 54 | 64 | 56 | 59 | 25 | 27 | 31 | 47 | 39 | 46 | 15 |
| 4 | 21 | 39 | 24 | 17 | 10 | 14 | 14 | 28 | 28 | 30 | 3 |
| 5 | 7 | 8 | 6 | 8 | 3 | 2 | 8 | 6 | 4 | 9 | 0 |
| 6 | 1 | 0 | 1 | 2 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
By default, background_gradient highlights each number relative to the others in its column; you can highlight by row using axis=1 or relative to all table values using axis=0. And of course plasma is just one of many available colormaps!
Here are a couple of other styling tips for dataframes.
First, use bars to show ordering:
(
df.iloc[:10, -6:-1]
.style.format(precision=0)
.bar(subset=["cost_per_sq_ft", "sale_price"], color="#d65f5f")
)
| cost_per_sq_ft | orig_price | list_price | sale_price | splp_pct | |
|---|---|---|---|---|---|
| 1 | 6 | 17000 | 10500 | 7000 | 67 |
| 2 | 22 | 35000 | 35000 | 27000 | 77 |
| 3 | 18 | 54000 | 47000 | 28000 | 60 |
| 4 | 26 | 65000 | 49000 | 30000 | 61 |
| 5 | 24 | 35000 | 35000 | 30750 | 88 |
| 6 | 39 | 45900 | 45900 | 42000 | 92 |
| 7 | 60 | 62500 | 56500 | 46000 | 81 |
| 8 | 43 | 54500 | 52500 | 49500 | 94 |
| 9 | 55 | 59900 | 52500 | 50000 | 95 |
| 10 | 34 | 50000 | 50000 | 50000 | 100 |
Second, use format to add informative suffixes and round numbers appropriately:
df.groupby(["bedrooms"])[["splp_pct"]].mean().style.format("{:.0f}%")
| splp_pct | |
|---|---|
| bedrooms | |
| 0 | 97% |
| 1 | 89% |
| 2 | 92% |
| 3 | 94% |
| 4 | 94% |
| 5 | 93% |
| 6 | 90% |
| 7 | 95% |
| 8 | 82% |
Third, use .hightlight_max, and similar commands, to show important entries:
df.iloc[:10, -6:-1].style.highlight_max().format("{:.0f}")
| cost_per_sq_ft | orig_price | list_price | sale_price | splp_pct | |
|---|---|---|---|---|---|
| 1 | 6 | 17000 | 10500 | 7000 | 67 |
| 2 | 22 | 35000 | 35000 | 27000 | 77 |
| 3 | 18 | 54000 | 47000 | 28000 | 60 |
| 4 | 26 | 65000 | 49000 | 30000 | 61 |
| 5 | 24 | 35000 | 35000 | 30750 | 88 |
| 6 | 39 | 45900 | 45900 | 42000 | 92 |
| 7 | 60 | 62500 | 56500 | 46000 | 81 |
| 8 | 43 | 54500 | 52500 | 49500 | 94 |
| 9 | 55 | 59900 | 52500 | 50000 | 95 |
| 10 | 34 | 50000 | 50000 | 50000 | 100 |
You can find a full set of styling commands here.
Exploratory plotting¶
pandas has some built-in plotting options to help you look at data quickly. These can be accessed via .plot.* or .plot(), depending on the context. Let’s make a quick .plot() of the mean quarterly nominal Sale Price over time (with extra options passed via keyword arguments):
(
df.set_index("datetime")
.groupby(pd.Grouper(freq="3M"))["sale_price"]
.mean()
.apply(lambda x: x / 1e3)
.plot(
title="House sales in Grinnell, Iowa",
xlabel="",
ylabel="Mean nominal sale price (000s USD)",
)
);
Again, if you can get the data in the right shape, you can plot it. The same function works with multiple lines
(
df.set_index("datetime")
.groupby(pd.Grouper(freq="3M"))[["orig_price", "list_price", "sale_price"]]
.mean()
.apply(lambda x: x / 1e3)
.plot(style=["-", ":", "-."])
);
Now let’s see some of the other quick .plot.* options.
Here’s a kernel density estimation (KDE) plot. Just to show what’s possible, we’ll use the df.columnname syntax, an alternative to df['columnname'], and set limits via keyword arguments.
df.square_feet.plot.kde(ylim=(0, None), xlim=(0, None));
A bar chart (use barh for horizontal orientation; rot sets rotation of labels):
df.value_counts("bedrooms").sort_index().plot.bar(title="Counts", rot=0);
This chart shows why EDA is so important: if we are modelling residential housing, and we find that there are observations that don’t have any bedrooms, perhaps they should be excluded from the downstream analysis.
This next one, using .plot.hist, provides another instructive example of how EDA can reveal interesting issues with datasets. First, there are many more houses built in 1900 than seem plausible and, checking the documentation for the dataset, it does say that when a construction date is not known, it is entered as 1900-which explains what’s going on here. Second, some of the construction years are in the future! Is this plausible? It could be. Some houses are sold before the construction of an estate is completed; this could be the case here.
df["year_built"].plot.hist(bins=30, title="Year of construction");
Boxplot:
(df[["orig_price", "list_price", "sale_price"]].plot.box());
df["class_ln_price"] = pd.cut(np.log(df["orig_price"]), bins=4, precision=0)
(
df.set_index("datetime")
.groupby([pd.Grouper(freq="A"), "class_ln_price"])["class_ln_price"]
.count()
.unstack()
.plot.area(ylabel="Count")
);
Scatter plot:
df.plot.scatter(x="square_feet", y="sale_price", alpha=0.7, ylim=(0, None));
Hexbin:
df.plot.hexbin(
y="list_price",
x="splp_pct",
gridsize=10,
cmap="inferno_r",
C="cost_per_sq_ft",
sharex=False,
);
skimpy for summary statistics¶
The skimpy package is a light weight tool that provides summary statistics about variables in data frames in the console (rather than in a big HTML report, which is what the other EDA packages in the rest of this chapter too). Sometimes df.summary() isn’t enough but a HTML report is too much, and skimpy fills this gap. Let’s see it in action.
from skimpy import skim
skim(df)
╭───────────────────────────────────── skimpy summary ──────────────────────────────────────╮ │ Data Summary Data Types Categories │ │ ┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┓ ┏━━━━━━━━━━━━━┳━━━━━━━┓ ┏━━━━━━━━━━━━━━━━━━━━━━━┓ │ │ ┃ dataframe ┃ Values ┃ ┃ Column Type ┃ Count ┃ ┃ Categorical Variables ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━┩ ┡━━━━━━━━━━━━━╇━━━━━━━┩ ┡━━━━━━━━━━━━━━━━━━━━━━━┩ │ │ │ Number of rows │ 929 │ │ float64 │ 8 │ │ bedrooms │ │ │ │ Number of columns │ 17 │ │ int64 │ 5 │ │ class_ln_price │ │ │ └───────────────────┴────────┘ │ category │ 2 │ └───────────────────────┘ │ │ │ string │ 1 │ │ │ │ datetime64 │ 1 │ │ │ └─────────────┴───────┘ │ │ number │ │ ┏━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━┳━━━━━━━┓ │ │ ┃ ┃ missin ┃ comple ┃ mean ┃ sd ┃ p0 ┃ p25 ┃ p75 ┃ p100 ┃ hist ┃ │ │ ┃ ┃ g ┃ te ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┃ ┃ ┃ rate ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ │ │ ┡━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ │ │ date │ 0.0 │ 1.0 │ 18000. │ 1100.0 │ 17000 │ 17000. │ 19000 │ 20000. │ █▇▃▄▅ │ │ │ │ │ │ │ 0 │ │ .0 │ 0 │ .0 │ 0 │ ▆ │ │ │ │ baths │ 0.0 │ 1.0 │ 1.8 │ 0.74 │ 0.0 │ 1.0 │ 2.0 │ 6.0 │ █▃▁ │ │ │ │ square │ 18.0 │ 0.98 │ 1600.0 │ 660.0 │ 640.0 │ 1200.0 │ 1800. │ 6800.0 │ █▃▁ │ │ │ │ _feet │ │ │ │ │ │ │ 0 │ │ │ │ │ │ lot_si │ 190.0 │ 0.8 │ 0.72 │ 2.7 │ 0.029 │ 0.23 │ 0.37 │ 55.0 │ █ │ │ │ │ ze │ │ │ │ │ │ │ │ │ │ │ │ │ year_b │ 0.0 │ 1.0 │ 1900.0 │ 37.0 │ 1900. │ 1900.0 │ 2000. │ 2000.0 │ █▁▆▅ │ │ │ │ uilt │ │ │ │ │ 0 │ │ 0 │ │ ▃ │ │ │ │ year_s │ 0.0 │ 1.0 │ 2000.0 │ 3.1 │ 2000. │ 2000.0 │ 2000. │ 2000.0 │ ██▂▄▇ │ │ │ │ old │ │ │ │ │ 0 │ │ 0 │ │ ▄ │ │ │ │ month_ │ 0.0 │ 1.0 │ 6.8 │ 3.0 │ 1.0 │ 5.0 │ 9.0 │ 12.0 │ ▃▅██▆ │ │ │ │ sold │ │ │ │ │ │ │ │ │ ▅ │ │ │ │ day_so │ 0.0 │ 1.0 │ 16.0 │ 9.4 │ 1.0 │ 8.0 │ 25.0 │ 31.0 │ ▆▅▅▅▆ │ │ │ │ ld │ │ │ │ │ │ │ │ │ █ │ │ │ │ cost_p │ 0.0 │ 1.0 │ 83.0 │ 36.0 │ 0.0 │ 61.0 │ 110.0 │ 260.0 │ ▂█▇▂ │ │ │ │ er_sq_ │ │ │ │ │ │ │ │ │ │ │ │ │ ft │ │ │ │ │ │ │ │ │ │ │ │ │ orig_p │ 0.0 │ 1.0 │ 150000 │ 85000. │ 6000. │ 90000. │ 18000 │ 700000 │ ██▂ │ │ │ │ rice │ │ │ .0 │ 0 │ 0 │ 0 │ 0.0 │ .0 │ │ │ │ │ list_p │ 0.0 │ 1.0 │ 140000 │ 82000. │ 10000 │ 88000. │ 17000 │ 700000 │ █▇▁ │ │ │ │ rice │ │ │ .0 │ 0 │ .0 │ 0 │ 0.0 │ .0 │ │ │ │ │ sale_p │ 0.0 │ 1.0 │ 130000 │ 78000. │ 7000. │ 83000. │ 16000 │ 610000 │ ▇█▂▁ │ │ │ │ rice │ │ │ .0 │ 0 │ 0 │ 0 │ 0.0 │ .0 │ │ │ │ │ splp_p │ 0.0 │ 1.0 │ 94.0 │ 7.0 │ 46.0 │ 91.0 │ 98.0 │ 110.0 │ ▁█ │ │ │ │ ct │ │ │ │ │ │ │ │ │ │ │ │ └────────┴────────┴────────┴────────┴────────┴───────┴────────┴───────┴────────┴───────┘ │ │ category │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┓ │ │ ┃ ┃ missing ┃ complete rate ┃ ordered ┃ unique ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━┩ │ │ │ bedrooms │ 0.0 │ 1.0 │ False │ 9.0 │ │ │ │ class_ln_price │ 0.0 │ 1.0 │ True │ 4.0 │ │ │ └────────────────────────┴─────────────┴──────────────────────┴─────────────┴──────────┘ │ │ datetime │ │ ┏━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓ │ │ ┃ ┃ missing ┃ complete rate ┃ first ┃ last ┃ frequency ┃ │ │ ┡━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩ │ │ │ datetime │ 0.0 │ 1.0 │ 2005-04-02 │ 2015-04-09 │ None │ │ │ └────────────┴───────────┴──────────────────┴──────────────┴──────────────┴────────────┘ │ │ string │ │ ┏━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓ │ │ ┃ ┃ missing ┃ complete rate ┃ words per row ┃ total words ┃ │ │ ┡━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩ │ │ │ address │ 0.0 │ 1.0 │ 4.0 │ 3700.0 │ │ │ └─────────────┴────────────┴────────────────────┴────────────────────┴─────────────────┘ │ ╰─────────────────────────────────────────── End ───────────────────────────────────────────╯
Exercise
Apply exploratory data analysis to the ‘titanic’ dataset, which can be loaded from seaborn using df = sns.load_dataset("titantic") after you have imported seaborn using import seaborn as sns.
First run skimpy’s skim function. Then see what analysis you can do using pandas built-in exploratory methods.
The pandas-profiling package¶
The EDA we did using the built-in pandas functions was a bit limited and user-input heavy. The pandas-profiling library aims to automate the legwork of EDA for you. It generates ‘profile’ reports from a pandas DataFrame. For each column, many statistics are computed and then relayed in an interactive HTML report.
Let’s generate a report on our dataset using the minimal=True setting (the default settings produce a lot of computationally expensive extras):
from pandas_profiling import ProfileReport
profile = ProfileReport(
df, minimal=True, title="Profiling Report: Grinnell House Sales"
)
profile.to_notebook_iframe()
Pandas backend loaded 1.3.3
Numpy backend loaded 1.21.2
Pyspark backend NOT loaded
Python backend loaded
This is a full on report about everything in our dataset! We can see, for instance, that we have 17 numerical variables, 0 boolean, and 4 categorical (which includes string), and 1 datetime. The overview also tells us the number of observations, that we have 1.2% of values missing across all columns, and we have zero duplicates.
The warnings page shows where pandas-profiling really shines. It flags potential issues with the data that should be taken into account in any subsequent analysis. For example, although not relevant here, the report will say if there are very unbalanced classes in a low cardinality categorical variable.
Let’s explore a couple of the warnings that have been raised. Address has high cardinality; but that’s okay, we expect most addresses to be different because the turn-over of houses is not that frequent. We also see that there are a lot of missing values for LotSize.
Now let’s dig down into a specific variable, SalePrice. You can view more about it using the ‘Toggle details’ button. We get four extra pages of detail about this variable, in addition to the headline statistics: a comprehensive page of descriptive statistics, a page with a histogram of the variable, a page with a table of common values, and a page of extreme values.
Exercise
Run pandas-profiling on the ‘titanic’ dataset from the previous exercise.
The dataprep package¶
The dataprep package offers very similar functionality to pandas-profiling; it produces an in-depth report on the input data. There isn’t much difference between them in general apart from it seems like dataprep has slightly better support for string column types (check out the details of Address below) and is a little bit richer on visualisations (it has interactive plots that you can change the axes of-look at the interactions page below).
Do note, however, that some column types can create an error-which is why we’ll pass in all but the last column of our dataframe from the previous set of examples. In this case, having a bin-category column seems to be an issue.
from dataprep.eda import create_report
report = create_report(df.iloc[:, 2:-1])
from IPython.core.display import display
display(report)
Dataset Statistics
| Number of Variables | 14 |
|---|---|
| Number of Rows | 929 |
| Missing Cells | 206 |
| Missing Cells (%) | 1.6% |
| Duplicate Rows | 10 |
| Duplicate Rows (%) | 1.1% |
| Total Size in Memory | 103.3 KB |
| Average Row Size in Memory | 113.9 B |
| Variable Types |
|
Dataset Insights
| orig_price and list_price have similar distributions | Similar Distribution |
|---|---|
| square_feet has 18 (1.94%) missing values | Missing |
| lot_size has 188 (20.24%) missing values | Missing |
| baths is skewed | Skewed |
| lot_size is skewed | Skewed |
| year_built is skewed | Skewed |
| splp_pct is skewed | Skewed |
| Dataset has 10 (1.08%) duplicate rows | Duplicates |
| bedrooms has constant length 1 | Constant Length |
bedrooms
categorical
| Approximate Distinct Count | 9 |
|---|---|
| Approximate Unique (%) | 1.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory Size | 9.0 KB |
- The largest value (3) is over 2.03 times larger than the second largest value (4)
Length
| Mean | 1 |
|---|---|
| Standard Deviation | 0 |
| Median | 1 |
| Minimum | 1 |
| Maximum | 1 |
Sample
| 1st row | 2 |
|---|---|
| 2nd row | 3 |
| 3rd row | 4 |
| 4th row | 3 |
| 5th row | 3 |
Letter
| Count | 0 |
|---|---|
| Lowercase Letter | 0 |
| Space Separator | 0 |
| Uppercase Letter | 0 |
| Dash Punctuation | 0 |
| Decimal Number | 929 |
- The top 2 categories (3, 4) take over 50.0%
- The largest value (3) is over 2.03 times larger than the second largest value (4)
- bedrooms has words of constant length
baths
numerical
| Approximate Distinct Count | 16 |
|---|---|
| Approximate Unique (%) | 1.7% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 1.7793 |
| Minimum | 0 |
| Maximum | 6 |
| Zeros | 12 |
| Zeros (%) | 1.3% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- baths is skewed right (γ1 = 0.9974)
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 1 |
| Q1 | 1 |
| Median | 1.75 |
| Q3 | 2 |
| 95-th Percentile | 3 |
| Maximum | 6 |
| Range | 6 |
| IQR | 1 |
Descriptive Statistics
| Mean | 1.7793 |
|---|---|
| Standard Deviation | 0.7432 |
| Variance | 0.5523 |
| Sum | 1653 |
| Skewness | 0.9974 |
| Kurtosis | 2.6003 |
| Coefficient of Variation | 0.4177 |
- baths is not normally distributed (p-value 1.2977924595984369e-12)
- baths has 9 outliers
square_feet
numerical
| Approximate Distinct Count | 526 |
|---|---|
| Approximate Unique (%) | 57.7% |
| Missing | 18 |
| Missing (%) | 1.9% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.2 KB |
| Mean | 1583.2151 |
| Minimum | 640 |
| Maximum | 6815 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- square_feet is skewed right (γ1 = 2.1205)
Quantile Statistics
| Minimum | 640 |
|---|---|
| 5-th Percentile | 884 |
| Q1 | 1150 |
| Median | 1440 |
| Q3 | 1833 |
| 95-th Percentile | 2855 |
| Maximum | 6815 |
| Range | 6175 |
| IQR | 683 |
Descriptive Statistics
| Mean | 1583.2151 |
|---|---|
| Standard Deviation | 655.665 |
| Variance | 429896.5471 |
| Sum | 1.4423e+06 |
| Skewness | 2.1205 |
| Kurtosis | 8.0646 |
| Coefficient of Variation | 0.4141 |
- square_feet is not normally distributed (p-value 7.391410506461778e-06)
- square_feet has 45 outliers
lot_size
numerical
| Approximate Distinct Count | 398 |
|---|---|
| Approximate Unique (%) | 53.7% |
| Missing | 188 |
| Missing (%) | 20.2% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 11.6 KB |
| Mean | 0.7235 |
| Minimum | 0.02893 |
| Maximum | 55 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- lot_size is skewed right (γ1 = 13.0898)
Quantile Statistics
| Minimum | 0.02893 |
|---|---|
| 5-th Percentile | 0.14 |
| Q1 | 0.2339 |
| Median | 0.2841 |
| Q3 | 0.3702 |
| 95-th Percentile | 1.82 |
| Maximum | 55 |
| Range | 54.9711 |
| IQR | 0.1363 |
Descriptive Statistics
| Mean | 0.7235 |
|---|---|
| Standard Deviation | 2.7257 |
| Variance | 7.4294 |
| Sum | 536.0862 |
| Skewness | 13.0898 |
| Kurtosis | 224.9221 |
| Coefficient of Variation | 3.7676 |
- lot_size is not normally distributed (p-value 4.578328328239603e-25)
- lot_size has 93 outliers
year_built
numerical
| Approximate Distinct Count | 101 |
|---|---|
| Approximate Unique (%) | 10.9% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 1945.9031 |
| Minimum | 1870 |
| Maximum | 2013 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- year_built is skewed left (γ1 = -0.1125)
Quantile Statistics
| Minimum | 1870 |
|---|---|
| 5-th Percentile | 1900 |
| Q1 | 1900 |
| Median | 1956 |
| Q3 | 1973 |
| 95-th Percentile | 2001.6 |
| Maximum | 2013 |
| Range | 143 |
| IQR | 73 |
Descriptive Statistics
| Mean | 1945.9031 |
|---|---|
| Standard Deviation | 37.2099 |
| Variance | 1384.5768 |
| Sum | 1.8077e+06 |
| Skewness | -0.1125 |
| Kurtosis | -1.3947 |
| Coefficient of Variation | 0.01912 |
- year_built is not normally distributed (p-value 7.969970487818614e-22)
year_sold
numerical
| Approximate Distinct Count | 11 |
|---|---|
| Approximate Unique (%) | 1.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 2009.366 |
| Minimum | 2005 |
| Maximum | 2015 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- year_sold is skewed right (γ1 = 0.1688)
Quantile Statistics
| Minimum | 2005 |
|---|---|
| 5-th Percentile | 2005 |
| Q1 | 2007 |
| Median | 2009 |
| Q3 | 2012 |
| 95-th Percentile | 2014 |
| Maximum | 2015 |
| Range | 10 |
| IQR | 5 |
Descriptive Statistics
| Mean | 2009.366 |
|---|---|
| Standard Deviation | 3.1241 |
| Variance | 9.7603 |
| Sum | 1.8667e+06 |
| Skewness | 0.1688 |
| Kurtosis | -1.3769 |
| Coefficient of Variation | 0.001555 |
- year_sold is not normally distributed (p-value 1.3795096597536255e-05)
month_sold
numerical
| Approximate Distinct Count | 12 |
|---|---|
| Approximate Unique (%) | 1.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 6.831 |
| Minimum | 1 |
| Maximum | 12 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- month_sold is skewed left (γ1 = -0.0799)
Quantile Statistics
| Minimum | 1 |
|---|---|
| 5-th Percentile | 2 |
| Q1 | 5 |
| Median | 7 |
| Q3 | 9 |
| 95-th Percentile | 12 |
| Maximum | 12 |
| Range | 11 |
| IQR | 4 |
Descriptive Statistics
| Mean | 6.831 |
|---|---|
| Standard Deviation | 3.0358 |
| Variance | 9.216 |
| Sum | 6346 |
| Skewness | -0.07994 |
| Kurtosis | -0.9231 |
| Coefficient of Variation | 0.4444 |
- month_sold is not normally distributed (p-value 0.000131284058198017)
day_sold
numerical
| Approximate Distinct Count | 31 |
|---|---|
| Approximate Unique (%) | 3.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 16.226 |
| Minimum | 1 |
| Maximum | 31 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- day_sold is skewed left (γ1 = -0.0919)
Quantile Statistics
| Minimum | 1 |
|---|---|
| 5-th Percentile | 1 |
| Q1 | 8 |
| Median | 16 |
| Q3 | 25 |
| 95-th Percentile | 30 |
| Maximum | 31 |
| Range | 30 |
| IQR | 17 |
Descriptive Statistics
| Mean | 16.226 |
|---|---|
| Standard Deviation | 9.3968 |
| Variance | 88.3001 |
| Sum | 15074 |
| Skewness | -0.09188 |
| Kurtosis | -1.2452 |
| Coefficient of Variation | 0.5791 |
cost_per_sq_ft
numerical
| Approximate Distinct Count | 856 |
|---|---|
| Approximate Unique (%) | 92.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 83.2916 |
| Minimum | 0 |
| Maximum | 258.79 |
| Zeros | 18 |
| Zeros (%) | 1.9% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- cost_per_sq_ft is skewed right (γ1 = 0.1412)
Quantile Statistics
| Minimum | 0 |
|---|---|
| 5-th Percentile | 21.934 |
| Q1 | 60.82 |
| Median | 82.88 |
| Q3 | 105.77 |
| 95-th Percentile | 143.968 |
| Maximum | 258.79 |
| Range | 258.79 |
| IQR | 44.95 |
Descriptive Statistics
| Mean | 83.2916 |
|---|---|
| Standard Deviation | 35.7622 |
| Variance | 1278.9379 |
| Sum | 77377.86 |
| Skewness | 0.1412 |
| Kurtosis | 0.4294 |
| Coefficient of Variation | 0.4294 |
- cost_per_sq_ft has 4 outliers
orig_price
numerical
| Approximate Distinct Count | 385 |
|---|---|
| Approximate Unique (%) | 41.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 146047.084 |
| Minimum | 5990 |
| Maximum | 695000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- orig_price is skewed right (γ1 = 1.8714)
Quantile Statistics
| Minimum | 5990 |
|---|---|
| 5-th Percentile | 49900 |
| Q1 | 89900 |
| Median | 129900 |
| Q3 | 179000 |
| 95-th Percentile | 313000 |
| Maximum | 695000 |
| Range | 689010 |
| IQR | 89100 |
Descriptive Statistics
| Mean | 146047.084 |
|---|---|
| Standard Deviation | 85174.1094 |
| Variance | 7.2546e+09 |
| Sum | 1.3568e+08 |
| Skewness | 1.8714 |
| Kurtosis | 5.8245 |
| Coefficient of Variation | 0.5832 |
- orig_price is not normally distributed (p-value 7.121042683429925e-06)
- orig_price has 47 outliers
list_price
numerical
| Approximate Distinct Count | 373 |
|---|---|
| Approximate Unique (%) | 40.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 141399.1389 |
| Minimum | 10500 |
| Maximum | 695000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- list_price is skewed right (γ1 = 1.8469)
Quantile Statistics
| Minimum | 10500 |
|---|---|
| 5-th Percentile | 46360 |
| Q1 | 88000 |
| Median | 125000 |
| Q3 | 172000 |
| 95-th Percentile | 292800 |
| Maximum | 695000 |
| Range | 684500 |
| IQR | 84000 |
Descriptive Statistics
| Mean | 141399.1389 |
|---|---|
| Standard Deviation | 82228.2018 |
| Variance | 6.7615e+09 |
| Sum | 1.3136e+08 |
| Skewness | 1.8469 |
| Kurtosis | 5.8706 |
| Coefficient of Variation | 0.5815 |
- list_price is not normally distributed (p-value 6.017772964822792e-06)
- list_price has 44 outliers
sale_price
numerical
| Approximate Distinct Count | 405 |
|---|---|
| Approximate Unique (%) | 43.6% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 133203.7201 |
| Minimum | 7000 |
| Maximum | 606000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- sale_price is skewed right (γ1 = 1.6968)
Quantile Statistics
| Minimum | 7000 |
|---|---|
| 5-th Percentile | 37000 |
| Q1 | 83000 |
| Median | 119340 |
| Q3 | 162500 |
| 95-th Percentile | 281900 |
| Maximum | 606000 |
| Range | 599000 |
| IQR | 79500 |
Descriptive Statistics
| Mean | 133203.7201 |
|---|---|
| Standard Deviation | 77959.6247 |
| Variance | 6.0777e+09 |
| Sum | 1.2375e+08 |
| Skewness | 1.6968 |
| Kurtosis | 4.8016 |
| Coefficient of Variation | 0.5853 |
- sale_price is not normally distributed (p-value 2.432835900761821e-05)
- sale_price has 47 outliers
splp_pct
numerical
| Approximate Distinct Count | 593 |
|---|---|
| Approximate Unique (%) | 63.8% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Memory Size | 14.5 KB |
| Mean | 93.564 |
| Minimum | 46.5 |
| Maximum | 111.02 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negatives | 0 |
| Negatives (%) | 0.0% |
- splp_pct is skewed left (γ1 = -2.5043)
Quantile Statistics
| Minimum | 46.5 |
|---|---|
| 5-th Percentile | 82.13 |
| Q1 | 91.32 |
| Median | 94.92 |
| Q3 | 97.56 |
| 95-th Percentile | 100 |
| Maximum | 111.02 |
| Range | 64.52 |
| IQR | 6.24 |
Descriptive Statistics
| Mean | 93.564 |
|---|---|
| Standard Deviation | 6.9538 |
| Variance | 48.3548 |
| Sum | 86920.95 |
| Skewness | -2.5043 |
| Kurtosis | 10.8721 |
| Coefficient of Variation | 0.07432 |
- splp_pct is not normally distributed (p-value 8.666744393185514e-07)
- splp_pct has 51 outliers
Review¶
That’s it for this very brief tour of exploratory data analysis with code. You should know now how to do simple analysis of pandas dataframes using techniques from other chapters to create summary statistics, using styling of tables of summary statistics, using the quick visualisation options that are built-in to pandas, and using the skimpy package. You should also know where to go for more in-depth, heavy duty exploratory data analysis reports, namely pandas-profiling and dataprep.